The assembly was created using only the diploid samples, and with the Yukon population removed (given its apparent distance from the rest of the sexual populations). SNPs were then called on the diploid reference using Freebayes, which is ploidy aware. This generated >400k SNPs, which were whittled down to ~16k SNPs using the following filtering steps:
# big filter. This filters based on quality, minor allele count, minor allele frequency, average depth (low and high), excludes indels and SNPs with more than 2 alleles, and removes SNPs that have >5% missing data.
~/bcftools/bcftools filter TotalRawSNPs.vcf | bcftools filter -e "QUAL < 30 & MQM < 30 & MQMR < 30 || AVG(GQ) < 20 || F_MISSING > 0.05 || MAC < 3 || MAF <0.05 || AVG(FMT/DP)<20 || AVG(FMT/DP)>132.5" | bcftools view --max-alleles 2 --exclude-types indels --output-type v --output-file filter1.vcf
# allele balance
vcffilter -s -f "AB > 0.125 & AB < 0.875 | AB < 0.01" filter1.vcf > filter2.vcf
# SNPs found on both strands
vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s filter2.vcf > filter3.vcf
# ratio of mapping qualities between reference and alternate alleles
vcffilter -f "MQM / MQMR > 0.9 & MQM / MQMR < 1.05" filter3.vcf > filter4.vcf
# discrepancy in properly paired status of reads supporting reference or alternate alleles
vcffilter -f "PAIRED > 0.05 & PAIREDR > 0.05 & PAIREDR / PAIRED < 1.75 & PAIREDR / PAIRED > 0.25 | PAIRED < 0.05 & PAIREDR < 0.05" -s filter4.vcf > filter5.vcf
# ratio of locus quality score to depth
vcffilter -f "QUAL / DP > 0.25" filter5.vcf > final.filtered.snps.vcf
Checking for individual missingness after an early filtering step showed this:
It’s hard to see the names, but two of the samples from C59 (Yukon) and one from B46 have over 30% missingness. We should probably remove those before filtering, because they retain high(ish) missingness even after filtering:
After removing those three samples and filtering, our SNP count went from 16.2k to 17.1k (so not a huge difference). Here’s what it looks like:
Should we remove these individuals?
Using poppr, we were able to identify the same number of MLGs as the IBS (from SNPRelate), but automatically and with the full ploidy information. One thing to note is that it grouped the sexual Yukon pop into the same MLG, which makes sense given that they had higher IBS values.
Here’s a bunch of DAPC’s using different groups / colors.
Using MS as the two groups (Density plot of PC1)
Pops as groups, but colored by mating system
Pops as groups - all pops
Pops as groups - sub C59-S SM-A C23-A S03-A
Here’s a plot of the posterior membership probabilities:
This is hot off the press. I used a K-means approach from the adegenet package (following this tutorial: http://grunwaldlab.github.io/Population_Genetics_in_R/clustering_plot.html), which uses a clustering algorithm that is similar to STRUCTURE. Plot A below is based on 50 runs of the find.clusters algorithm, and shows that ~10-13 clusters have the lowest BIC. Plot B is a scatterplot of the discriminant functions (K=10), which helps us see how different the resulting clusters are. Plot C shows barplots of the posterior probability group assignments for K = 10-13, which helps visualize how the groups are assigned under different values of K. The K-means results are more-or-less consistent with our other findings, as they assign clones to pretty much the same groups.
Poppr function - genotype richness/diversity/evenness statsThis shows us several summary statistics:
| Abbreviation | Statistic |
|---|---|
Pop |
Population name. |
N |
Number of individuals observed. |
MLG |
Number of multilocus genotypes (MLG) observed. |
eMLG |
The number of expected MLG at the smallest sample size ≥ 10 based on rarefaction |
SE |
Standard error based on eMLG. |
H |
Shannon-Wiener Index of MLG diversity |
G |
Stoddart and Taylor’s Index of MLG diversity |
lambda |
Simpson’s Index |
E.5 |
Evenness, \(E_5\) |
Hexp |
Nei’s unbiased gene diversity |
Ia |
The index of association, \(I_A\) |
rbarD |
The standardized index of association |
Here are the results from the poppr function on the populations.
## Pop N MLG H G lambda E.5 Ia p.Ia rbarD p.rD
## 1 B42-S 3 3 1.099 3.00 0.667 1.000 9.401 0.001 1.68e-03 0.001
## 2 B46-S 3 3 1.099 3.00 0.667 1.000 154.674 0.001 2.52e-02 0.001
## 3 B49-S 3 3 1.099 3.00 0.667 1.000 25.641 0.001 4.39e-03 0.001
## 4 B53-S 3 3 1.099 3.00 0.667 1.000 2.368 0.031 4.05e-04 0.031
## 5 B60-S 3 3 1.099 3.00 0.667 1.000 2.806 0.030 5.59e-04 0.030
## 6 C59-S 3 1 0.000 1.00 0.000 NaN 75.414 0.001 3.18e-02 0.001
## 7 L05-S 3 3 1.099 3.00 0.667 1.000 522.314 0.001 7.98e-02 0.001
## 8 L08-S 3 3 1.099 3.00 0.667 1.000 3.491 0.015 4.91e-04 0.014
## 9 L10-S 3 3 1.099 3.00 0.667 1.000 3.801 0.017 5.11e-04 0.016
## 10 L11-S 3 3 1.099 3.00 0.667 1.000 -0.434 0.880 -5.62e-05 0.883
## 11 L12-S 3 3 1.099 3.00 0.667 1.000 0.195 0.330 2.71e-05 0.326
## 12 L13-S 3 3 1.099 3.00 0.667 1.000 5.724 0.001 8.04e-04 0.001
## 13 L45-S 1 1 0.000 1.00 0.000 NaN NA NA NA NA
## 14 L62-S 4 4 1.386 4.00 0.750 1.000 -0.264 0.723 -3.06e-05 0.728
## 15 C23-A 5 1 0.000 1.00 0.000 NaN 51.792 0.001 2.08e-02 0.001
## 16 C27-A 5 1 0.000 1.00 0.000 NaN 13.832 0.001 6.61e-03 0.001
## 17 C43-A 5 1 0.000 1.00 0.000 NaN 433.823 0.001 9.14e-02 0.001
## 18 C85-A 5 1 0.000 1.00 0.000 NaN 66.818 0.001 2.50e-02 0.001
## 19 C86-A 5 1 0.000 1.00 0.000 NaN 29.491 0.001 1.35e-02 0.001
## 20 C87-A 5 1 0.000 1.00 0.000 NaN 976.004 0.001 2.33e-01 0.001
## 21 C88-A 5 1 0.000 1.00 0.000 NaN 5.107 0.001 2.25e-03 0.001
## 22 L06-A 5 1 0.000 1.00 0.000 NaN 42.656 0.001 1.89e-02 0.001
## 23 L16-A 5 2 0.673 1.92 0.480 0.961 2870.226 0.001 3.10e-01 0.001
## 24 L17-A 5 1 0.000 1.00 0.000 NaN 11.574 0.001 5.73e-03 0.001
## 25 L39-A 5 2 0.500 1.47 0.320 0.725 3161.707 0.001 4.94e-01 0.001
## 26 L41-A 4 1 0.000 1.00 0.000 NaN 129.139 0.001 4.98e-02 0.001
## 27 L45-A 4 1 0.000 1.00 0.000 NaN 11.345 0.001 6.57e-03 0.001
## 28 L62-A 1 1 0.000 1.00 0.000 NaN NA NA NA NA
## 29 S03-A 4 1 0.000 1.00 0.000 NaN 7.464 0.001 3.80e-03 0.001
## 30 SM-A 5 1 0.000 1.00 0.000 NaN 121.604 0.001 2.86e-02 0.001
## 31 Total 114 49 3.096 9.30 0.892 0.393 5759.122 0.001 3.69e-01 0.001
Here’s a NJ tree on the populations. It uses provesti distance, which is probably biased, but it’s a start…
Here’s a slightly different one…